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PRELIMINARY AMENDMENT 

This application is related to U.S. patent application serial number 
09/285,481, filed April 2, 1999, and entitled "Automated Process Line", which 
is referred to and incorporated herein in its entirety by this reference. 

Please replace the paragraphs beginning on page 2, line 4, through page 
3, line 12, with the following paragraphs: 

In a particularly exciting area of genomics, the identification and 
classification of minute variations in human DNA has been linked with 
fundamental treatment or medical advice for a specific individual. For example, 
the variations are a strong indication of predisposition for a particular disease, 
drug tolerance, and drug efficiency. The most promising of these minute 
variations are commonly referred to as Single Nucleotide Polymorphisms (SNPs), 
which relate to a single base-pair change between a first subject and a second 
subject. By accurately and fully identifying such SNPs, a health care provider 
would have a powerful indication of a person's likelihood of succumbing to a 
particular disease, which drugs will be most effective for that person, and what 
drug treatment plan will be most beneficial. Armed with such knowledge, the 
health care provider can assist a person in lowering other risk factors for high- 
susceptibility diseases. Further, the health care provider can confidently select 
appropriate drug therapies, a process which is now an iterative, hit or miss 
process where different drugs and treatment schedules are tried until an 
effective one is found. Not only is this a waste of limited medical resources, but 
the time lost in finding an effective therapy can have serious medical 
consequences for the patient. 

In order to fully benefit from the use of SNP data, vast quantities of DNA 
data must be collected, compared, and analyzed. For example, collecting and 
identifying the SNP profile for a single human subject requires the collection, 
identification, and classification of thousands, even tens of thousands of DNA 
samples. Further, the analysis of the resulting DNA data must be carried out 
with precision. In making a genetic call, where a composition of a biological 
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sample is identified, any error in the call may result in detrimentally affecting the 
medical advice or treatment given to a patient. 

Conventional, known systems and processes for collecting and analyzing 
DNA data are inadequate to timely and efficiently implement a widespread 
medical program benefiting from SNP information. For example, many known 
DNA analysis techniques require the use of an operator or technician to monitor 
and review the DNA data. An operator, even with sufficient training and 
substantial experience, is still likely to occasionally make a classification error. 
For example, the operator may incorrectly identify a base-pair, leading to that 
patient receiving faulty SNP profile. Alternatively, the operator may view the 
data and decide that the data do not clearly identify any particular base pair. 
Although such a "no call" may be warranted, it is likely that the operator will 
make "no-call" decisions when the data actually support a valid call. In such a 
manner, the opportunity to more fully profile the patient is lost. 

Please replace the paragraphs on page 3, line 22, through page 4, line 13, 
with the following paragraphs: 

Briefly, the method and system for identifying a biological sample 
generates a data set indicative of the composition of the biological sample. In a 
particular example, the data set is DNA spectrometry data received from a mass 
spectrometer. The data set is denoised, and a baseline is deleted. Since 
possible compositions of the biological sample may be known, expected peak 
areas may be determined. Using the expected peak areas, a residual baseline is 
generated to further correct the data set. Probable peaks are then identifiable in 
the corrected data set, which are used to identify the composition of the 
biological sample. In a disclosed example, statistical methods are employed to 
determine the probability that a probable peak is an actual peak, not an actual 
peak, or that the data are too inconclusive to call. 
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Advantageously, the method and system for identifying a biological 
sample accurately makes composition calls in a highly automated manner. In 
such a manner, complete SNP profile information, for example, may be collected 
efficiently. More importantly, the collected data are analyzed with highly 
accurate results. For example, when a particular composition is called, the 
result may be relied upon with great confidence. Such confidence is provided 
by the robust computational process employed and the highly automatic method 
of collecting, processing, and analyzing the data set. 

Please replace the paragraphs beginning on page 6, line 10, through page 
7, line 20, with the following paragraphs: 

In accordance with the present invention, a method and device for 
identifying a biological sample is provided. Referring now to FIG. 1, an 
apparatus 10 for identifying a biological sample is disclosed. The apparatus 10 
for identifying a biological sample generally comprises a mass spectrometer 15 
communicating with a computing device 20. In a preferred embodiment, the 
mass spectrometer may be a MALDI-TOF mass spectrometer manufactured by 
Bruker-Franzen Analytik GmbH; however, it will be appreciated that other mass 
spectrometers can be substituted. The computing device 20 is preferably a 
general purpose computing device. However, it will be appreciated that the 
computing device could be alternatively configured; for example, it may be 
integrated with the mass spectrometer or could be part of a computer in a larger 
network system. 

The apparatus 10 for identifying a biological sample may operate as an 
automated identification system having a robot 25 with a robotic arm 27 
configured to deliver a sample plate 29 into a receiving area 31 of the mass 
spectrometer 15. In such a manner, the sample to be identified may be placed 
on the plate 29 and automatically received into the mass spectrometer 15. The 
biological sample is then processed in the mass spectrometer to generate data 
indicative of the mass of DNA fragments into biological sample. These data 
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may be sent directly to computing device 20, or may have some preprocessing 
or filtering performed within the mass spectrometer. In a preferred embodiment, 
the mass spectrometer 1 5 transmits unprocessed and unfiltered mass 
spectrometry data to the computing device 20. However, it will be appreciated 
that the analysis in the computing device may be adjusted to accommodate 
preprocessing or filtering performed within the mass spectrometer. 

Referring now to FIG. 2, a general method 35 for identifying a biological 
sample is shown. In method 35, data are received into a computing device from 
a test instrument in block 40. Preferably the data are received in a raw, 
unprocessed and unfiltered form, but alternatively may have some form of 
filtering or processing applied. The test instrument of a preferred embodiment is 
a mass spectrometer as described above. However, it will be appreciated that 
other test instruments could be substituted for the mass spectrometer. 

The data generated by the test instrument, and in particular the mass 
spectrometer, include information indicative of the identification of the biological 
sample. More specifically, the data are indicative of the DNA composition of the 
biological sample. Typically, mass spectrometry data gathered from DNA 
samples obtained from DNA amplification techniques are noisier than, for 
example, those from typical protein samples. This is due in part because protein 
samples are more readily prepared in more abundance, and protein samples are 
more easily ionizable as compared to DNA samples. Accordingly, conventional 
mass spectrometer data analysis techniques are generally ineffective for DNA 
analysis of a biological sample. 

Please replace the paragraphs beginning on page 8, line 13, through page 
9, line 14 with the following paragraphs: 

Referring again to FIG. 2, the data received in block 40 are denoised in 
block 45. The denoised data then has a baseline correction applied in block 50. 
A baseline correction is generally necessary as data coming from the test 
instrument, in particular a mass spectrometer instrument, has data arranged in a 
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generally exponentially decaying manner. This generally exponential decaying 
arrangement is not due to the composition of the biological sample, but is a 
result of the physical properties and characteristics of the test instrument and 
other chemicals involved in DNA sample preparation. Accordingly, baseline 
correction substantially corrects the data to remove a component of the data 
attributable to the test system and sample preparation characteristics. 

After denoising in block 45 and the baseline correction in block 50, a 
signal remains which is generally indicative of the composition of the biological 
sample. However, due to the extraordinary discrimination required for analyzing 
the DNA composition of the biological sample, the composition is not readily 
apparent from the denoised and corrected signal. For example, although the 
signal may include peak areas, it is not yet clear whether these "putative" peaks 
actually represent a DNA composition, or whether the putative peaks are result 
of a systemic or chemical aberration. Further, any call of the composition of the 
biological sample would have a probability of error which would be unacceptable 
for clinical or therapeutic purposes. In such critical situations, there needs to be 
a high degree of certainty that any call or identification of the sample is 
accurate. Therefore, additional data processing and interpretation are necessary 
before the sample can be accurately and confidently identified. 

Since the quantity of data resulting from each mass spectrometry test is 
typically thousands of data points, and an automated system may be set to 
perform hundreds or even thousands of tests per hour, the quantity of mass 
spectrometry data generated is enormous. To facilitate efficient transmission 
and storage of the mass spectrometry data, block 55 shows that the denoised 
and baseline corrected data are compressed. 

Please replace the paragraph on page 10, lines 1-11, with the following 
paragraph: 

Referring again to block 40, data are received from the test instrument, 
which is preferably a mass spectrometer. In a specific illustration, FIG. 3 shows 
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an example of data from a mass spectrometer. The mass spectrometer data 70 
generally comprises data points distributed along an x-axis and a y-axis. The x- 
axis represents the mass of particles detected, while the y-axis represents a 
numerical concentration of the particles. As can be seen in FIG. 3, the mass 
spectrometry data 70 is generally exponentially decaying with data at the left 
end of the x-axis generally decaying in an exponential manner toward data at 
the heavier end of the x-axis. However, the general exponential presentation of 
the data is not indicative of the composition of the biological sample, but is 
more reflective of systematic error and characteristics. Further, as described 
above and illustrated in FIG. 3, considerable noise exists in the mass 
spectrometry DNA data 70. 

Please replace the paragraphs beginning on page 11, line 12, through 
page 12, line 3, with the following paragraphs: 

FIG. 5 shows an example of stage 0 high data 95. Since stage 0 high 
data 95 is generally indicative of the highest frequencies in the mass 
spectrometry data, stage 0 high data 95 will closely relate to the quantity of 
high frequency noise in the mass spectrometry data. In FIG. 6, an exponential 
fitting formula has been applied to the stage 0 high data 95 to generate a stage 
0 noise profile 97. In particular, the exponential fitting formula is in the format 
A 0 + A, EXP (-A 2 m). It will be appreciated that other exponential fitting 
formulas or other types of curve fits may be used. 

Referring now to FIG. 7, noise profiles for the other high stages are 
determined. Since the later data points in each stage will likely be 
representative of the level of noise in each stage, only the later data points in 
each stage are used to generate a standard deviation figure that is 
representative of the noise content in that particular stage. More particularly, in 
generating the noise profile for each remaining stage, only the last five percent 
of the data points in each stage are analyzed to determine a standard deviation 
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number. It will be appreciated that other numbers of points, or alternative 
methods could be used to generate such a standard deviation figure. 

Please replace the paragraph on page 14, lines 10-12, with the following 
paragraph: 

The formula is generally indicated in FIG. 10. Once the signal has been 
denoised and shifted, a denoised and shifted signal 130 is generated as shown 
in FIG. 12. FIG. 1 1 shows an example of the wavelet coefficient 135 data set 
from the denoised and shifted signal 1 30. 

Please replace the paragraph on page 15, lines 12-20, with the following 
paragraph: 

Referring again to FIG. 2, the data from the baseline correction 50 is now 
compressed in block 55; the compression technique used in a preferred 
embodiment is detailed in FIG. 18. In FIG. 18 the data in the baseline corrected 
data are presented in an array format 182 with x-axis points 183 having an 
associated data value 184. The x-axis is indexed by the non-zero wavelet 
coefficients, and the associated value is the value of the wavelet coefficient. In 
the illustrated data example in table 182, the maximum value 184 is indicated to 
be 1000. Although a particularly advantageous compression technique for mass 
spectrometry data is shown, it will be appreciated that other compression 
techniques can be used. Although not preferred, the data may also be stored 
without compression. 

Please replace the paragraphs on page 16, line 16, through page 17, line 
14, with the following paragraphs: 

FIG. 19 generally describes the method of compressing mass 
spectrometry data, showing that the data file in block 201 is presented as an 
array of coefficients in block 202. The data starting point and maximum is 
determined as shown in block 203, and the intermediate real numbers are 
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calculated in block 204 as described above. With the intermediate data points 
generated, the compressed data is generated in block 205. The described 
compression method is highly advantageous and efficient for compressing data 
sets such as a processed data set from a mass spectrometry instrument. The 
method is particularly useful for data, such as mass spectrometry data, that use 
large numbers and have been processed to have occasional lengthy gaps in x- 
axis data. Accordingly, an x-y data array for processed mass spectrometry data 
may be stored with an effective compression rate of 10x or more. Although the 
compression technique is applied to mass spectrometry data, it will be 
appreciated with the method may also advantageously be applied to other data 
sets. 

Referring again to FIG. 2, peak heights are now determined in block 60. 
The first step in determining peak height is illustrated in FIG. 20 where the 
signal 210 is shifted left or right to correspond with the position of expected 
peaks. As the set of possible compositions in the biological sample is known 
before the mass spectrometry data is generated, the possible positioning of 
expected peaks is already known. These possible peaks are referred to as 
expected peaks, such as expected peaks 212, 214, and 216. Due to calibration 
or other errors in the test instrument data, the entire signal may be shifted left 
or right from its actual position; therefore, putative peaks located in the signal, 
such as putative peaks 218, 222, and 224 may be compared to the expected 
peaks 212, 214, and 216, respectively. The entire signal is then shifted such 
that the putative peaks align more closely with the expected peaks. 

Please replace the paragraph on page 19, lines 3-14, with the following 
paragraph: 

An indication of the confidence that each putative peak is an actual peak 
can be discerned by calculating a signal-to-noise ratio for each putative peak. 
Accordingly, putative peaks with a strong signal-to-noise ratio are generally 
more likely to be an actual peak than a putative peak with a lower signal-to- 
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to have a final peak probability of 90%. Peak 325 is positioned such that it 
represents a G component in the biological sample. Accordingly, it can be 
maintained that there is a 90% probability that G exists in the biological sample. 
Also in the example shown in FIG. 31, the second highest probability is peak 
330 which has a peak probability of 20%. Peak 330 is at a position associated 
with a C composition. Accordingly, it can be maintained that there is a 20% 
probability that C exists in the biological sample. 

Please replace the paragraphs on page 23, line 10, through page 24, line 
18, with the following paragraphs: 

Once the probabilities of each of the possible combinations has been 
determined, FIG. 32 is used to decide whether or not sufficient confidence 
exists to call the genotype. FIG. 32 shows a call chart 335 which has an x-axis 
337 which is the ratio of the highest combination probability to the second 
highest combination probability. The y-axis 339 simply indicates whether the 
ratio is sufficiently high to justify calling the genotype. The value of the ratio 
may be indicated by M. The value of M is set depending upon trial data, sample 
composition, and the ability to accept error. For example, the value M may be 
set relatively high, such as to a value 4 so that the highest probability must be 
at least four times greater than the second highest probability before confidence 
is established to call a genotype. However, if a certain level of error may be 
acceptable, the value of M may be set to a more aggressive value, such as to 3, 
so that the ratio between the highest and second highest probabilities needs to 
be only a ratio of 3 or higher. Of course, moderate value may be selected for M 
when a moderate risk can be accepted. Using the example of FIG. 31, where 
the probability of GG was 72% and the probability of GC was 18%, the ratio 
between 72% and 18% is 4.0; therefore, whether M is set to 3, 3.5, or 4, the 
system would call the genotype as GG. Although the preferred embodiment 
uses a ratio between the two highest peak probabilities to determine if a 
genotype confidently can be called, it will be appreciated that other methods 
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